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ABSTRACT 

Peer evaluation research was reviewed frcii three 
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validity in a military training situation. Evaluation criteria 
included leadership potential^ promotion potential^ personality 
traits, and supervisory skills. Substantial validity was generally 
found, with correlation coefficients in the .30 to ^50 range. 
Reliability and validity of different evaluation methods (rating, 
ranking, nominations, and combinations of these techniques) did not 
vary substantially. Evaluation methods did, however, differ in 
feasibility and acceptability, the latter largely a function of 
familiarity with the evaluation procedure and perceived difficulty^ 
Situational factors have documented and potential effects which 
developers and users of peer evaluation should recognize., uhese 
factors include group si^e, informal group structure, demographic 
characteristics, group boundaries, hierarchical characteristics, 
friendship, length of association, and types of interaction. Although 
many issues surrounding peer evaluation or associate evaluation 
remain unresolved, evidence suggests they are a powerful tool in 
discriminating complex human behavior- (Authcr/CE) 
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FOREWORD 



This research, carried out within the Personnel Accession and Utili- 
zation Technical Area of the Army Research Institute (ARI) , includes a 
representative review of previous findings, both within the Army and 
otherwise, on the validity and reliability of peer evaluations. The 
research also reviews several situational or contextual factors that 
should be considered in conducting peer evaluations. 

This research is an in-house effort and is responsive to Army Project 
2Q162717A766 and to special roquirements of the Office of Deputy Chief 
of Staff for Personnel. 
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REVIEW OF PEER EVALUATION RESEARCH 



BRIEF 



Requirement; 

To review previous findings on the validity emd reliability of peer 
evaluations as well as various situational moderators- 



Procedure : 

Peer evaluation research was reviewed from the four major perspec- 
tives of evaluation process, methodology, situational factors, and valid- 
ity studies* 

Findings : 

Studies investigating the structure and nature of the peer evalua- 
tion process have generally found fairly clear factor structure across 
widely varying s^unples• There is some evidence that the structure may 
be as much in the nature of the rater as the ratee, A review of findings 
from research that utilized different methods indicated little evidence 
for substantial differences, in either reliability or validity, among 
techniques. Further, a review of the documented and potential effects 
of situational factors impacting on the evalxiation process indicated 
that users of peer evaluation should be aware of these issues in design- 
ing programs ♦ Research generally has found suJDstantial concurrent and 
predictive validity, with correlations in the .30 to .50 range, but with 
most studies limited to training groups. 

Utilization of Findings: 

Several issues surrounding peer evaluations remain unresolved; how- 
ever, evidoricc suggests that these issues can be resolved, and that peer 
evaluations are a ^XDwcrful tool in discriminating complex human behavior. 
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INTRODUCTION 

When confronted with the prospect of drawing order out of complex 
human behavior in the equally complex world of work, much traditional 
behavioral science research has been marked by two primary characteris- 
tics. First, heavy reliance has been placed upon human evaluations of 
other human beings. Second, this evaluative information has been typi- 
cally gathered from a limited observational viewpoint, that of a superior 
toward a subordinate. The technique presented in this paper does not 
deviate from the first of thpce characteristics; it does rely on human 
evaluation of other huinan beings. However, it goes beyond the second 
characteristic by gathering evaluative information from the perspective 
of an individual's peers. For purposes of this paper, peers are opera- 
tionally defined thus: (a) they have some common purpose or frame of 
reference (e.g., members of the same work group), and (h) generally 
speaking, they lack a formally recognized authority relationship between 
them. Although the term "peer rating*' is most commonly applied to this 
technique, the present paper uses the more generic term "evaluation," 
reserving the term "rating" for one particular technique. 

A source of much confusion in peer evaluation research has been a 
lack of clarity between the technique and the dimension or characteris- 
tic evaluated. Although previous work reviewed here substantially sup- 
ports use of peer evaluation as a technique, issues surrounding the 
particular dimensions evaluated are not discussed in this review. 

This paper contains three relatively complementary sections. First, 
a representative selection of typical validity research is reviewed, 
along with a brief history of the use of peer evaluations. The second 
section discusses various methodological issues underlying the peer eval- 
uation technique, and the third section presents several situational or 
contextual factors that can affect a peer evaluation effort. 



VALIDITY OF PEER EVALUATIONS 

The history of the peer evaluation technique can be trared from the 
seminal work of Moreno (1934) and the development of the sociogram tech- 
niq":<*. However, the history of the technique as it is dealt with here 
is more conveniently traced to several efforts conducted during and after 
World War II (see, for example Clarke, 1946; U.S. Army Research Insti- 
tute, 1943; Wherry, 1945). One of the earliest investigations published 
in the professional literature is that by Williams and Ledvitt (1947) . 




since that time, peer evaluations have been used for two primary pur- 
poses. The first of these purposes is evaluative in the criterion sense; 
The concern is in judging the extent or adequacy of some individual char- 
acteristic (e.g., leadership effectiveness, job performance). The second 
purpose is evaluative in the sense of gaining information with which to 
predict some future outcome (individual potential, motivation to wrk, 
etc.). Both purposes have guided the efforts in research as welZ as 
operational settings, although typically only one purpose has been the 
focus in any given situation. 

Tcible 1 summarizes the results and major characteristics of a repre- 
sentative sampling of studies which report validity information for peer 
evaluations. This overview is intentionally not exhaustive, since several 
other more specialized reviews are available elsewhere (e.g., Gibb, 1969; 
Hollander, 1954a; Boulger & Coleman, 1964; & Nadal , 1968). Lindze^ and 
Byrne (1968) have also presented an excellent review of the use of social 
choice methodology of which peer evaluations are one type. 

There are several noteworthy features in Table 1. First, the magni- 
tude of the validity coefficients is generally strong in both concurrent 
and predictive studies. Peer evaluations have shown rather strong pre- 
dictive ability even for periods up to 5 years (Hollander, 1965) . Fur- 
thermore, in those studies that included measures in addition to peer 
evaluations, the peer evaluations tended to have the highest concurrent 
or predictive validity. 

Also, the majority of the evidence for the value of peer evaluations 
has beun gathered in a training situation, particularly in the military 
environment. In fact, only two of the studies in Table 1 (Weitz, 1958; 
Downey, Medland, & Yates, 1976) used a sample from other than a training 
or educational environment. With a few exceptions, most evidence has 
been gained from people relatively low in the hierarchy of their organi- 
zational setting. 

A third major feature of Table 1 is the variety of dimensions that 
peers have been required to evaluate and the variety of criteria with 
which peer evaluations have been related. The peer evaluation dimen- 
sions have included leadership potential, personality traits, and super- 
visory sJcill, to name but a few. 
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Some Representative Studies on the Validity of Peer Evaluations 
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Investigators 



Amir, Kovarsky, & 
Sharan (1970) 



Berkshire & 
Nelson (1958) 

Butler (1974) 



Doll (1963) 



Downey (1973) 



Downey, Medland, 
& Yates (1976) 

Haggerty (1963) 



Hollander (1950 
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Type of subject Dimensions evaluated 



Criteria 



Correlation 



Enlisted military 
trainees 

NCO trainees 

Military officer 
trainees 

West Point 
trainees 

Military officer 
trainees 

Military cadets 

Senior military 
officer trainees 

Senior military 
officers 

West Point 
trainees 



Military officer 
trainees 



Promotion potential Promotion to NCO 



Promotion potential 
Promotion potential 

Leadership 



Promising cadets 
Promotion potential 



Leadership traits 

Leadership traits 
Leadership 



Promotion to officer 
c 

Graduation 
Performance 

Performance^ 
Promotion^ 



Promising officers Pass/fail 



Pass/fail 



Promotion 



Promotion potential Promotion^ 



Performance 

Performance"^ 
c 

Graduation 



.44** (1,979) 



,63** (1,918) 



(1,152) 
(1,152) 



.38** (547) 
.24** (547) 

.20** (606) 



.36** (660) 

"^** (246) 

,53** (242) 

,38** (120) 

,26** (253) 

.27** (268) 
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Table 1 (continued) 



Investigators 



Type of subject Dimensions evaluated 



Criteria 



Correlation 



Hollander (1965) 



Klieger, deJung, 
G Dubuisson (1962) 

Kraut (1975) 



Kubany (1957) 



Levi , Torrance , 
& Pletts (1958) 

Peterson, Lane, 
& Ambler (1966) 

Ricciuti (1955) 



Roadman (1964) 
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Military officer 
trainees' 

Enlisted military 
trainees 

Manager trainees 



Executive trainees 



Medical students 



Enlisted military 
trainees 

Military officer 
trainees 

Military officer 
trainees 



Management trainees 



Leadership 

Performance potential Discharge 



Grades 
Performance 

a 



Impact — 10 scales Promotion 

a 

Tactfulness — 3 scales Promotion 
Impact — 10 scales Performance^ 
Tactfulness — 3 scales 



Medical performance 
potential 

13 dimensions of per- 
sonality & potential 

Carefulness 



Leadership 



Performance 
Instructor 

c 

evaluations 

Dropout rate*" 
Performance^ 

Pass/fail 



Performance as 
midshipmen^ 
Performance 
training cruise 



a 

13 dimensions of per- Promotion 
sonali ty , achievement , 
& leadership 



• 51** (229) 
.37** (229) 

•42** (1,571) 



•31** (82) 
•02 (82) 
•35** (83) 
•37** (83) 
•48** (87) 



-"** (770) 
** (770) 



•22** (462) 
•32** (324) 
.26** (324) 



** 



(56) 
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Table 1 (continued) 



Investigators 


Type of subject 


Dimensions evaluated 


Criteria 


Correlation 


Smith (1967) 


College students 


Extraversion 


c 

GPA 


.05 (348) 






Strength of character 


c 

GPA 


.43** 


(348) 


TUp6S 


Military officer 


COTiposite of 30 per- 


Performance 


.51** 


(615) 




trainees 


sonality factors 


Grades^ 


.31** 


(615) 


Wau6rs 6i! Wafers 


Sales trainees 


Agreeable 


rerrormance 


-.27* 


(53) 


(1970) 
















Sales potential 


Performance^ 


.31* 


(53) 


weiuz V'L^Doj 


Salesmen 


Promotion potential 


trerronncLnce 


.40** 


(100) 


Wherry & Fryer 


Military officer 


Leadership 


c 

Retention 

c 


.70** 


(134) 








aauauion 


.49** 


(.34) 


Wlgg ins , Bl ackbur n , 


College graduate 


Academic success 


GPA 


.56** 


(46) 


& Hackman (1969) 


students 














Academic success 




.69** 


(58) 


Williams & 


Military officer 


Future potential 


a 

Performance 


.47** 


(100) 


Leavitt (1947) 


trainees 










Willingham (1958) 


Military officer 


17 leadership traits 


, . c 
Pass/fail 


.28** 


(994) 




trainees 











j^Predictive criterion. 

Numbers in parentheses are number of subjects. 
^Concurrent criterion. 

Significant group differences found • 
*p < .05. 
♦*p < .01. 
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Attempts to implement peer evaluation programs have produced an 
impressive array of findings. However, several limitations also appear- 
For instance, there is only minimal evidence of the validity of peer 
evaluations among individuals at organizationally higher levels. There 
is also a limited, but growing, amount of evidence of the utility of peer 
evaluations in other than the training environment* In addition, in 
studies that use peer evaluations as a predictor of a concurrent or fu- 
ture criterion, virtually all the validity evidence is of a bivariate 
variety. Although a number of studies demonstrated that peer evalua- 
tions are often the best single predictor from among several predictors, 
no research was found that attempted to determine what other predictors 
might account for unique variance along with peer evaluations. An ex- 
ception to this preoccupation with the bivariate paradigm is occasion- 
ally found in assessment center methodology. Mackinnon (1975) has else- 
where presented a comprehensive review of assessment centers, but even 
in assessment centers with a wealth of information available, the 
differential validity of peer evaluations has not always been adequately 
addressed . 



Peer evaluations have been performed by means of four primary tech- 
niques: ratings, rankings, full nominations, and high nominations. The 
general paradigm of the rating technique calls for a group member to pro- 
vide a rating of the relative amount or degree of the dimension under 
consideration possessed by every other group member- The ranking pro- 
cedure simply requires each group member to rank-order all other group 
members from high to low (or some other relevant continuum) on the dimen- 
sion under consideration. The full nomination technique requires that 
each group member choose a specified nuxnber or proportion of the group 
as being either high, medium, or low on a given dimension. The minor 
variation of this technique in which nominations of the middle are not 
required is also referred to as full nominations. However, the case in 
which only high nominations are elicited is reserved as a discriminably 
different technique, for reasons to be elaborated upon in later portions 
of the paper. 

Several variations based on combinations of these basic techniques 
are forced distribution rankings, or combinations of rankings with rat- 
ings. General scoring algorithms for the four primary techniques follow. 



METHODOLOGICAL ISSUES 



Ratings ; 



Score 




N 



Rankinqs : 



100 



Score 





Full Nominations: 



Score = 



liz^) + Z(2r^) + Z(3rj^) 



N 



High Nominations ; 



Score „ 

N 



where 







rating, 


^Rk 




ranking, 






low nomination. 






mid (or no) nomination. 






high nomination. 


N 




number giving an evaluation, and 


T 




total number in the group. 



All these techniques produce scores with means independent of group 
size, with the exception of the ranking formula, in which case adjustment 
must be made for group sizes greater than 100* The standard deviation of 
the various scores is a function of the reliability (consistency) of each 
group^s evaluations; Gordon (1969) and Willingham (1959) deal with gen- 
eral issues related to reliability* Also, for a group using either a 
ranking or nomination technique, the average score is determined; the 
average score using the rating technique is free to vary. 



Metric and Distribution 

The metric and distributional properties of associate evaluations 
are directly related to the particular technique employed. With respect 
to scaling properties, the rankings and both nomination procedures pro- 
duce an ordinal scale (Stevens, 1951). The ratings from an evaluator 
are the most nearly equal interval data, although here also it can be 
argued that these arc merely an ordinal scale. The scaling properties 
of the summated scores from the various techniques approximate interval 
data as the number in the evaluation group iiicreases. 
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The four common procedures will generally produce different distri- 
butions, examples of which are displayed in Figure 1. Given the rela- 
tively free response mode, ratings will often produce negatively skewed 
distributions largely because group norms tend to inflate any evaluative 
procedure. The ranking procedure, if it were perfectly reli2d)lef would 
produce a rectangular distribution with one person at each rank. Gener- 
ally, less than perfectly reliable rank scores will tend to be normally 
distributed, with very unrelicible scores producing a more leptokurtic 
curve, and a perfectly unreliable procedure producing a point distribu- 
tion with everyone receiving an average rank equal to the middle rank. 
Full nomination scores produce a distribution which, if perfectly rej.i- 
cible, is trimodal, with one group receiving all high nominations, another 
group all low nominations, and the remainder middle nominations or none 
at all. High nominations pxoduce a bimodal distribution (not shown in 
Figure 1) • 

Basis of Comparison 

Scores resulting from the four primary techniques vary along another 
important dimension — the evaluative process evoked in the evaluator upon 
which judgments are made. Drucker (1957) initially pointed out the du- 
ality of focus with which peer evaluations can be executed: whether the 
frame of reference or standard upon which the evaluations are made is in- 
ternal or external to the group. In one case, the evaluator compares 
the particular individual against a frame of reference external to the 
group and assigns the individual to a category. In the second case, the 
evaluator compares the particular individual against a frame of refer- 
ence internal to the group and makes a judgment of more or less, and 
assigns the individual to the appropriate category* The external process 
can be used only with the rating procedure . The internal process can 
also be used with ratings; with rankings and nominations, it is required. 
The internal process, in general, requires a moderate number of individ- 
uals in the group (more than five) , The direct implication of this dis- 
tinction is that the external frame of reference allows both comparison 
between individuals across peer groups and the compearison of peer groups. 
The internal process does not allow comparison between individuals across 
peer groups unless the assumption is accepted that the groups are equal 
on the particular ability, trait, or behavior, 

A corollary of this implication is that population norrojj can be 
developed only through the use of a rating procedure and an external 
frame of reference, again unless group equality is assumed or assured. 

Reliability 

The reliability of associate evaluations has generally been deter- 
mined by one of two methods, estimation of internal consistency or test- 
re test correlation. Both methods are analogous to the same procedures 
in classical test theory (Lord & Novick, 1968). 
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The internal consistency of peer evaluations is the degree to which 
members of a peer group agree with one another when observing an individ- 
ual in a similar situation and at the same time. Using the multiple- 
choice test paradigm, the evaluators are comparable to test items and 
those who are being evaluated are compcirable to persons taking the test. 
Although Gordon (1969) has recommended the use of the alpha coefficient 
for estimating the internal consistency or reli2Q3ility of peer evalua- 
tions, the most ccxnmon procedure has been a split-half (or group) esti- 
mate. The split-half estimate is made by randomly assigning peer group 
members to one of two groups, computing scores in each group for all 
group members, and then correlating the scores for each ratee from each 
group (see Hollander, 1957, & Downey, 1974), The correlation coeffi- 
cient is then adjusted for total group size using the Spearman-Brown 
formula (Gulliksen, 1950) , If small groups are used, a random split 
may not be possible, and some technique for averaging the intercorrela- 
tions between evaluators could be used (Gulliksen, 1950) . 

The test-retest method of estimating reliability requires that 
group members evaluate each other at two different times. Scores from 
the two different evaluations are then correlated. Examples of this 
type of estimate are given in Hollander (1957) and Downey (1974, 1976), 
Perhaps the most rigorous examination of relieibility was done by Gcrdon 
and Medland (1965), in which they varied both tim of administration and 
group doing the evaluations and found reliability coefficients in the 
80's, 

Research has generally demonstrated the reliability of peer evalua- 
tions to be in the ,70 to ,90 range, regardless of the type of reliabil- 
ity estimate employed. Research comparing the various evaluative method- 
ologies is rare but has generally supported the view thai: all four methods 
are quite similar, with perhaps a slight advantage to ratings (Suci, 
Vallance, £^ Glickman, 1954; Downey, 1974; Hammer, 1963) . Even the use 
of a paired comparison procedure does not significantly improve reliabil- 
ity (Bolton, 1971) , 



Acceptability 

A major factor in the success or failure of any peer evaluation 
procedure, whether for operational or research purposes, is the degree 
to which participants accept the purpose of the evaluations. Accept- 
iibility is generally studied as a specific issue of the particular pro- 
gram under investigation rather than comparative analyses of acceptcibil- 
ity across techniques or situations. There is therefore little formal 
evidence of differences between techniques in this respect, but infer- 
ences can be drawn from the particular qualities of the technique. 
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A major factor in the acceptability of a technique is the degree of 
perceived difficulty. From this point Oi.' view, both the rating and rank- 
ing of large numbers of individuals (more than 20) can be time-consuming 
and makes for difficult discriminations, particularly among group members 
who are more or less average on the particular dimension. On the other 
hand, the nomination procedure allows the individual to place a large 
number of people in a desired category and does not require such diffi- 
cult discriminations. 

The rating procedure is quite acceptable to the raters where the 
rated group is small and cohesive. The full nomination technique is ac- 
ceptable to the nominators for moderate-size to l^rge groups in which 
not all individuals are well known to one another. The high nomination 
technique is even more acceptcJ^le because it does not require an individ- 
ual to make negative evaluations . 

Another determinant of the degree of acceptability is the degree to 
which group members are knowledgeable about the evaluation procedure, 
process, background, and use. Downey (1975) found that acceptability 
improved as a function of an educational program. Two different con- 
siderations vere noted: (a) the degree to which peer evaluations were 
felt to b.<? valuable and accurate estimates and (b) the degree to which 
the evaluations were acceptable for particular uses, Downey also found 
that a person's peer evaluation score and degree of acceptance of the 
peer evaluation process were positively correlated; larger correlations 
were found in the group who knew less cibout the peer evaluation process. 

Feasibility 

Closely linked with the concept of acceptaibility is feasibility, 
or costs associated with the implementation and execution of a particu- 
lar peer evaluation system. The major costs associated with a peer eval- 
uation system are (a) preparation of evaluation materials, (b) adminis- 
tration time, and (c) scoring cost. Prior to the advent of automatic 
data processing procedures, the costs associated with use of any peer 
evaluation system in large groups or on a large scale were prohibitive . 
Merely in terms of bits of information collected, it can be seen that 
the number of evaluations is typically equal to n (n - 1) where n is the 
number in the group. Thus, peer evaluation systems are relatively costly 
efforts, which typically require more than minimal sophistication with 
data processing procedures* Unfortunately, little systematic information 
on cost is available. 



In addition to the methodological concerns of the various techniques, 
several situational or contextual factors can affect a peer evaluation 
system, often without regard to the specific technique under discussion. 
These factors include group size, informal group structures, demographic 



SITUATIONAL FACTORS 
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characteristics, group boundaries, hierarchical characteristics, friend- 
ships, length of association, and types of interaction* 



Group Size 

Very few attempts have been raade to study the independent effects 
of group size. More often than not, what evidence there is has been 
reported as a byproduct in research directed elsewhere. For example, 
Eowney, Medland, and Yates (1976) used a peer nomination technique with 
groups of TVrmy colonels in 14 career groups that varied in size from 22 
to 321. Reliability coefficients varied from .63 to .94 and the rank 
order coefficient between group size and reliability was .03. Downey 
(1976), in a sample of Army Raagers, compared peer ratings collected 
within squads ( n - 10) with peer nominations collected on the same men 
within platoons (n = 40) . Coefficients between the two scores were in 
the .60's. However, platoon scores were both more relied)le and more pre- 
dictive of job performance. 

As mentioned previously, from the standpoint of feasibility both 
ratings and rankings would seem to be most appropriate for relatively 
small group sizes (approximately a dozen) , whereas the nomination tech- 
nique is virtually mandatory for large groups (more than 50) . From the 
standpoint of empirical results, it appears that small groups may produce 
somewhat unreliadjle scores, with reduced validity. Alternatively, al- 
though it is rational to believe that there is an optimal upper size ^ 
peer group, scant evidence exists to support this view. 



Informal Group Structures 

Within any formally defined group, there may exist one or more in- 
formal subgroups defined by some sort of mutual self-interest • The issue 
then arises as to the effect these informal subgroups may have on a peer 
evaluation procedure conducted in the total group. 

The worst case would be one in which two equal-sized informal sub- 
groups existed within a total group, and each group member was exclu- 
sively in one subgroup or the other. In such a situation, one or both 
subgroups might make their evaluations solely on the basis of subgroup 
membership, i.e., on a basis other than the one intended. The net ef- 
fect of such behavior is to attenuate the validity of the peer evalua- 
tion procedure; attenuation is most pronounced when both subgroups engage 
in such behavior. The effect diminishes if one of the groups does, in 
fact, provide evaluations over the whole group on the dimension intended. 
The effect also diminishes as informal subgroup size decreases or as the 
number of subgroups increases. \ 
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In terms of technique^ the effect of subgroup behavior is pronounced 
if ratings or rankings are used. Resultant scores are most likely to be 
negatively skewed. The use of full nominations will tend to produce scores 
with decreased variance, and high nominations will produce the worst case 
with a drastic reduction in variance. An important point when using nomi- 
nations is that the use of too many nominations relative to total group 
size may increase the effect of subgroup behavior (see Downey, 1974) , 

It is clear that subgroups of sufficient size can have an effect 
upon the final scores. The problem is the incidence of such effects and 
whether there exists a mechanism for detecting them. If the evaluation 
process is part of an ongoing process, the simplest procedure for checking 
for these problems is the repetitive production of reliability indices 
as part of the procedure for producing peer scores. If the reliability 
coefficients were to drop below .60, it would probably indicate a prob- 
lem, and care should be taken in use of the evaluations. Alternatively/ 
a two-way analysis of variance design, one factor being the type of 
raters and the other factor being the same type of ratees could be used* 
If a significant interaction were found, then a strong case could be made 
for considering the peer scores as at least partially the result of group 
membership. 

Demographic Characteristics 

The use of peer evaluations with their reliance upon fallible human 
observers immediately raises the possibility of racial and sexual bias 
on the part of evaluators. This concern is especially crucial in view 
of recent problems associated with demonstrating the cibsence of bias in 
employment selection and classification measures as well as in criterion 



The evidence concerning racial bias in peer evaluations is mixed and 
inconclusive* In a study dealing with Air Force recruits. Cox and 
Krumboltz (1958) found that subjects were rated higher by members of 
their own race, but the effect varied across groups, and there was sub- 
stantial agreement on rank order across races (r = .76). T!hey concluded 
that any bias was far from complete and suggested that prior acquaintance- 
ship of group members might account for the differences. In a similar 
study in the Army, deJung and Kaplan (1962) found similar results: Rat- 
ings differed as a function of the rater's race. However, an analysis 
of covariance adjusting for a combined interest and math score showed 
that whites did not give higher adjusted scores to whites or blacks, 
but that blacks gave higher adjusted scores to blacks. Results were 
interpreted in terms of assignment of higher scores to close acquain- 
tances — a result had most impact upon blacks rating blacks (because of 
the smaller group size) . 
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In a more recent study in an industrial training context/ Schmidt 
and Johnson (1971) used a forced-choice rating distribution in groups 
made up of approximately equal numbers of blacks and whites ♦ No dif- 
ferences due to race were found. 

The evidence suggests that peer evaluations can be subject to racial 
bias, but the effect is perhaps more strongly related to the interaction 
between friendship or acquaintanceship and the particular evaluation 
method used than to the fact of race itself • The presence of substan- 
tial correlation between the rank orderings from each race indicates 
that the ordering was not much affected by race. But the use of ratings 
allows evaluators to assign unrelated scores to individuals whom they 
consider special in some way. 

In terms of sexual bias, Mohr and Downey (1977) recently reported 
results from a small sample of Army officers, in which females scored 
lower than males on evaluations received from both males and females. 
If bias occurred, it was on the part of both groups. An interesting 
finding was that females* self-ratings were not related to either male 
or female evaluations, but males' self -ratings were related to these 
evaluations. 

This admittedly small number of studies appears to indicate that 
differences based upon race and sex can occur, but does not make clear 
whether these difference.^ are attributable to race or sex group differ- 
ences, to interaction patterns (e.g., friendships), to the specific 
methodology, or to some combinations of these factors. It would cer- 
tainly be safe to say that researchers should be sensitive to the poten- 
tial for such bias. 



Group Boundaries 

The discussion of peer evaluations has proceeded to this point as 
if it were clear just what is meant by a peer or associate group. Most 
reseairchers report their procedures in sufficient detail to show the 
general characteristics of the groups in the study. However, given the 
variety of overlapping and higher order groups in most real-life settings, 
the issue becomes that of defining some basic guidelines for selecting 
the appropriate rating group. It is clear that the selection of the 
evaluative group can be affected by such factors as length and type of 
interaction, formal organizational structure, informal group structure, 
friendship patterns, and, of course, the particular dimension being 
evaluated. 

Thorc are few empirical findings to guide selection of the peer 
group. Rather, guidelines must be best guesses based on partial inf cre- 
mation from related data. 
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In a 1976 study, Downey found that platoon evaluations produced 
more reliable and slightly more valid scores than did squad evaluations, 
but the differences were potentially confounded by differences in method 
and group size* Gordon and Medland's 1965 study, in which individuals 
were evaluated at two different times by totally different groups, indi- 
cated a high degree of stability across the two evaluations. Even the 
method used to compute reliability indices, random splits of the primary 
group, supported the notion that group composition can be drastically 
altered without giving rise to major problems in the reliability and 
validity of scores. 



Hierarchical Characteristics 

A concept related to that of group boundaries is that of hierarchies. 
Suppose one were to perform a peer evaluation procedure in a traditionally 
hierarchical organization. If work groups at the subordinate level are 
chosen as the peer groups, what effect does inclusion of their immediate 
superiors have on the resulting evaluations? Conventional wisdom tends 
to hold that inclusion of such individuals can -contaminate the procedure, 
and therefore they should be excluded from the worker peer groups and in- 
cluded in a peer group of first-level supervisors. 

Again, results bearing upon hierarchical inclusion are mixed. Re- 
search by Levi, Torrance, and Pletts (1958) indicated no effects from 
including the formal leader in the peer evaluation process. Research 
by Downey in 1975, in which the leaders of small combat units were in- 
cluded in the peer nomination process, indicated that the leaders spanned 
the full range of peer evaluation scores. There was a positive relation- 
ship between formal position and peer evaluation scores of leadership 
potential (as there should be, if the original selection procedure for 
leaders had any validity). These data were experimental, and the intro- 
duction of an operational system might change the result. 

A rational solution to the boundary/hierarchical problem should be 
guided by the following suggestions: 

1. The group selected should be large enough to overcome problems 
associated with primary groups. 

2. The group should not be so large as to include subgroups who 
may be relatively unknown to each other or may be competing for 
similar resources and rewards. 

3. The function of the group selected should be reasonably related 
to the dimension to be evaluated; e.g., if evaluation of leader- 
ship in a work setting is desired, a work group and not a social 
group should be selected. 
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Kricndrjhif) has been a major research issue in the history of peer 
evaluations. According to folklore, peer evaluations are the product 
of friendship or popularity and are therefore not valid indications of 
the dimension under consideration. The impact of this bit oZ folklore 
has been that, with the exception of simple validity studies, this is 
probably the single most researched question associated with peer 
evaluations. 

Wherry and Fryer (1949) were the first tc address the issue of 
friendship in peer ratings. They reported that although there was a 
moderate degree of relationship between friendship and a leadership cri- 
terion, the major portion of the predicted criterion variance was inde- 
pendent of friendship. They concluded that peer evaluations of leader- 
ship are not popularity contests. Studies by Gibb (1950) and Horrocks 
and V7ear (1953) in college samples supported Wherry and Fryer's findings. 
Borgatta (1954) also reported that leadership and popularity evaluations 
were related, but he failed to draw any conclusions. Several other in- 
vestigations have documented a moderate degree of relationship between 
friendship and peer evaluations of leadership (Hollander, 1956; Hollander 
& Webb, 1955; Theordorson, 1957) . 

Downey (1974) presented evidence that the use of full nominations 
(with small numbers of high and low nominations required) reduced the 
correlation between friendship and leadership evaluations compared with 
forced distribution ratings. 

It seems that when an evaluator is faced with the task of evaluat- 
ing several people, some of whom he or she considers friends, the eval- 
uator will tend to select a friend rather than another person considered 
to be of equal, or at least indistinguishable, merit. Therefore, the 
vciriance associated with friendship may be a source of systematic error 
primarily in the middle of the distribution. This systematic error var- 
iance will increase in large groups, in which some members are relatively 
unknown to each other or the interaction patterns are not fully estab- 
lished for all members. 

However, in spite of the impressive array of research findings as 
to the minimal effect of friendship, the "popularity contest" issue re- 
mains the argument most consistently offered against the use of peer 
evaluations in an operational setting. 

Length of Assoc i a l^on 

Whon poor evaluations arc considered for use in any situation, an 
important question is how long group members must be associated with 
each other before they can provide reliable and valid evaluations. This 
issue is often raised in the context of transient training groups. 
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Research fairly consistently finds that peers can make reliad^le and 
valid evaluations after a relatively short period of time — typically 
3 to 6 weeks (Hollander, 1957). 



Subsidiary to the overall issue is the effect of including a new 
group member in an intact group. Mayfield (1975) has suggested that in 
such a situation there may be reason to suspect that a longer period of 
acquaintanceship is necessary for sufficient integration into the group. 
A more generalized way of approaching the question is to determine which 
person is known or not well known to other members of the group. Evi- 
dence has shown that an individual not well known to other members of the 
group will typically be evaluated as near the middle of the distribution 
of peer evaluation scores within the group (Downey, 1974) , 

In tems of technique, a nomination procedure is most likely to de- 
crease the error variance associated with acquaintanceship; ratings or 
rankings tend to capitalize on the error variance and show a greater de- 
gree of relationship with acquaintanceship. 



Type of Interaction 

Although peer evaluations have been used and reported over a span 
of more than 25 years, they have been applied in rather limited situa- 
tions. Most of the research has been conducted with junior personnel in 
a military training context such as Officer Candidate School (OCS) . A 
recent effort to use a peer nomination process in a senior Arroy officer 
promotion system produced supportive results (Downey, Medland, & Yates, 
1976) . Outside the military, Weitz (1958) and subsequently Mayfield 
(1970; 1975) have worked in industry with insurance salesmen, 

Preeberg (1969) reported a project in which peer evaluations were 
more highly related to a performance criterion when the interaction be- 
tween peers was relevant to the dimension being evaluated. Bayroff and 
Machlin (1950) found that leadership evaluations could be made in an 
academic environment and were highly related to evaluations made after 
exposure to a situation where leadership was displayed. Lewin, Dubno, 
and Akula (1971) indicated that video tapes supplied sufficient informa- 
tion for reliable evaluations and that these evaluations were highly re- 
lated to evaluations from group members. 

Until more extensive research is conducted in broader organiza- 
tional contexts with a wider selection of subject populations, the gen- 
erality of the peer evaluation process is largely a matter of conjec- 
ture. However, it would be safe to assume that peer evaluations of a 
variety of complex human behaviors can be rendered reliably after 
exposure of the peers to each other in situations that require the 
individual to interact cither with the environment or with others in 
relevant sitniations. Further, the validity of the evaluations will be 
a function of the doqree to which the particular behaviors are relevant 
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to the dimension under study, Hollander (1956) found that reliable 
evaluations were given after 1 hour of discussion between peers in a 
naval OCS class, but the scores had only moderate relationship with 
evaluations obtained 3 weeks later/ and were even less predictive of 
eventual job performance. This convergence of views by peers after a 
short period of exposure is probably a function of similar psychological 
maps of behavior on the part of peers, and the preliminary evaluations 
are subject to revision based upon further information • There seems 
to be little advantage in using one evaluative technique over another r 
so long as the technique does not require the evaluator to make finer 
discriminations than are possible, based on the type of interaction 
and the amount of information that can be gathered from the interaction. 



Researchers have used the peer evaluation technique both as a cri- 
terion of complex human behavior and as an index of future potential . 
The particular dimension measured has varied consideretbly. The validity 
research summarized presents an impressive array of findings with cor- 
relation coefficients in the ,30 to .50 range either in a concurrent or 
a predictive situation. Research on extending the generality of the peer 
evaluation procedure to a more diverse sampling of peer group types, 
particularly nontraining groups, has been limited. 

The four major techniques have also demonstrated important simi- 
larities and differences in their psychometric properties. For example, 
only ratings can produce comparable scores across different groups with- 
out extensive assumptions. Research results indicate little differences 
in measurement reliability between techniques. The limited findings also 
indicate that, in general, ratings and rankings are less acceptable than 
either of the nomination techniques. 

In view of the documented and likely effects of various situational 
factors on the evaluation process, it is important that the researcher 
be aware of potential problems in the use of peer evaluations. No direct 
relationship was found between group size and the reliability or validity 
of the evaluations, but it can be assumed that very small or very large 
groups will produce less reliable and less valid scores. Group struc- 
ture and demographic characteristics were found to be sources of poten- 
tial difficulties. With respect to the popular issues of friendship, 
acquaintanceship, and type of personal interaction, there is little 
evidence that these have a major impact on the validity of the scores. 
Indications are that all techniques are relatively impervious to a vari- 
ety of situational factors, the nomination technique being perhaps the 
most versatile. 



SUMMARY 




18 



• 1 



One possible adjustment in future work with this technique is to 
begin referring to it as associate evaluation rather th2m peer evalua- 
tion* The term peer evaluation, or more commonly peer rating, has ac- 
quired overtones of meaning and often has a negative connotation 2unong 
those required to perform the evaluations* Moreover, the more general- 
ized rubric "associate evaluation" conceptually embraces more individuals 
the distinction should not be merely semantic* 

In brief, peer evaluations, or associate evaluations, have been 
shown to be fruitful tools in both research and application. Several 
issues regarding their use remain to be resolved, but there is suffi- 
cient evidence to suggest that these issues can be resolved, and that 
they do not detract from the conclusion that associate evaluations are 
a very powerful tool for discriminating complex human behavior. 
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